Predicting two groups of people for marketing purposes from U.S. Census Bureau

Understanding the Data

Problem Statement

As a data scientist, you are tasked by your retail business client with identifying two groups of people for marketing purposes: People who earn an income of less than $50,000 and those who earn more than $50,000. To assist in this pursuit, Walmart has developed a means of accessing 40 different demographic and employment related variables for any person they are interested in marketing to. Additionally, Walmart has been able to compile a dataset that provides gold labels for a variety of observations of these 40 variables within the population. Using the dataset given, train and validate a classifier that predicts this outcome.

Our goal is to develop a predictive model using the provided dataset to assess whether the income level of people in the United States is greater/lesser than $50,000.

It is clear from the problem statement that it is a classification problem, Let's have alook at the target variable

Data Cleaning

Redundancy in Data

Missing Values

The hispanic origin is modified such that

Garbage Values

If the people have

then it is likely that they might not be working at the time

Label Encoding

There is high correlation between

Undersampling

Segmention Model with kmeans

The clustering results showed that the 3 clusters had distinct characteristics. The first cluster consisted of mainly children,never married and non filer of tax having income less than 50,000. The second cluster included aged (18-60years) worker in private sector or attending high school , if married spouse are present and mostly not Hispanic. The third cluster was dominated by armed forces and unemployed individuals, also who were working less number of weeks in a year. Furthermore, it was observed that the clusters were similar in terms of citizenship and race.